{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Основы скрапинга и парсинга\n",
    "*[По материалам](https://proglib.io/p/samouchitel-po-python-dlya-nachinayushchih-chast-17-osnovy-skrapinga-i-parsinga-2023-03-13)*\n",
    "### Скрапинг содержимого страницы\n",
    "Воспользуемся модулем urllib.request стандартной библиотеки urllib для получения исходного кода ОДНОСТРАНИЧНОГО САЙТА example.com"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from urllib.request import urlopen\n",
    "url = 'http://example.com'\n",
    "page = urlopen(url)\n",
    "print(page.read().decode('utf-8'))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Точно такой же результат можно получить с помощью requests:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "<!DOCTYPE html>\n",
      "\n",
      "<html>\n",
      "<head>\n",
      "<title>Example Domain</title>\n",
      "<meta charset=\"utf-8\"/>\n",
      "<meta content=\"text/html; charset=utf-8\" http-equiv=\"Content-type\"/>\n",
      "<meta content=\"width=device-width, initial-scale=1\" name=\"viewport\"/>\n",
      "<style type=\"text/css\">\n",
      "    body {\n",
      "        background-color: #f0f0f2;\n",
      "        margin: 0;\n",
      "        padding: 0;\n",
      "        font-family: -apple-system, system-ui, BlinkMacSystemFont, \"Segoe UI\", \"Open Sans\", \"Helvetica Neue\", Helvetica, Arial, sans-serif;\n",
      "        \n",
      "    }\n",
      "    div {\n",
      "        width: 600px;\n",
      "        margin: 5em auto;\n",
      "        padding: 2em;\n",
      "        background-color: #fdfdff;\n",
      "        border-radius: 0.5em;\n",
      "        box-shadow: 2px 3px 7px 2px rgba(0,0,0,0.02);\n",
      "    }\n",
      "    a:link, a:visited {\n",
      "        color: #38488f;\n",
      "        text-decoration: none;\n",
      "    }\n",
      "    @media (max-width: 700px) {\n",
      "        div {\n",
      "            margin: 0 auto;\n",
      "            width: auto;\n",
      "        }\n",
      "    }\n",
      "    </style>\n",
      "</head>\n",
      "<body>\n",
      "<div>\n",
      "<h1>Example Domain</h1>\n",
      "<p>This domain is for use in illustrative examples in documents. You may use this\n",
      "    domain in literature without prior coordination or asking for permission.</p>\n",
      "<p><a href=\"https://www.iana.org/domains/example\">More information...</a></p>\n",
      "</div>\n",
      "</body>\n",
      "</html>\n",
      "\n"
     ]
    }
   ],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "import requests\n",
    "url = 'http://example.com'\n",
    "res = requests.get(url)\n",
    "soup = BeautifulSoup(res.text, 'html.parser')\n",
    "print(soup)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Парсинг полученных данных\n",
    "### Способы извлечения данных:\n",
    "- Метод строк\n",
    "- Регулярне выражения\n",
    "- Запрос XPath\n",
    "- Обработка BeautifulSoup\n",
    "\n",
    "ИЗВЛЕЧЕМ АДРЕС ССЫЛКИ\n",
    "### Метод строк (самый трудоемкий)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.iana.org/domains/example\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "url = 'http://example.com/'\n",
    "page = urlopen(url)\n",
    "html_code = page.read().decode('utf-8')\n",
    "start = html_code.find('href=\"') + 6\n",
    "end = html_code.find('\">More')\n",
    "link = html_code[start:end]\n",
    "print(link)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Применение регулярных выражений"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "['https://www.iana.org/domains/example']\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "import re\n",
    "url = 'http://example.com/'\n",
    "page = urlopen(url)\n",
    "html_code = page.read().decode('utf-8')\n",
    "link = r'(https?://\\S+)(?=\")'\n",
    "print(re.findall(link, html_code))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Запрос XPath\n",
    "Язык запросов XPath (XML Path Language) позволяет извлекать данные из определенных узлов XML-документа. Для работы с HTML кодом в Python используют модуль etree:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "https://www.iana.org/domains/example\n"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "from lxml import etree\n",
    "url = 'http://example.com/'\n",
    "page = urlopen(url)\n",
    "html_code = page.read().decode('utf-8')\n",
    "tree = etree.HTML(html_code)\n",
    "print(tree.xpath(\"/html/body/div/p[2]/a/@href\")[0])"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Чтобы узнать путь к нужному элементу страницы, в браузерах Chrome и FireFox надо кликнуть правой кнопкой по элементу и выбрать «Просмотреть код», после чего откроется консоль. В консоли по интересующему элементу нужно еще раз кликнуть правой кнопкой, выбрать «Копировать», а затем – копировать путь XPath:\n",
    "![](https://media.proglib.io/posts/2023/03/08/1c7d748f356f2bc01980724eb8a754c1.png)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "В приведенном выше примере для извлечения ссылки к пути `/html/body/div/p[2]/a/` мы добавили указание для получения значения ссылки `@href`, и индекс `[0]`, поскольку результат возвращается в виде списка. Если `@href` заменить на `text()`, программа вернет текст ссылки, а не сам URL:\n",
    "```python\n",
    "print(tree.xpath(\"/html/body/div/p[2]/a/text()\")[0])\n",
    "```"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Парсинг с применением BeautiulSoup\n",
    "Библиоеку необходимо установить\n",
    "```\n",
    "pip install beautifulsoup4\n",
    "```\n",
    "В приведенном ниже примере мы будем извлекать из исходного кода страницы уникальные ссылки, за исключением внутренних:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "ename": "URLError",
     "evalue": "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mSSLCertVerificationError\u001b[0m                  Traceback (most recent call last)",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:1348\u001b[0m, in \u001b[0;36mAbstractHTTPHandler.do_open\u001b[0;34m(self, http_class, req, **http_conn_args)\u001b[0m\n\u001b[1;32m   1347\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m-> 1348\u001b[0m     h\u001b[39m.\u001b[39;49mrequest(req\u001b[39m.\u001b[39;49mget_method(), req\u001b[39m.\u001b[39;49mselector, req\u001b[39m.\u001b[39;49mdata, headers,\n\u001b[1;32m   1349\u001b[0m               encode_chunked\u001b[39m=\u001b[39;49mreq\u001b[39m.\u001b[39;49mhas_header(\u001b[39m'\u001b[39;49m\u001b[39mTransfer-encoding\u001b[39;49m\u001b[39m'\u001b[39;49m))\n\u001b[1;32m   1350\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mOSError\u001b[39;00m \u001b[39mas\u001b[39;00m err: \u001b[39m# timeout error\u001b[39;00m\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1282\u001b[0m, in \u001b[0;36mHTTPConnection.request\u001b[0;34m(self, method, url, body, headers, encode_chunked)\u001b[0m\n\u001b[1;32m   1281\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Send a complete request to the server.\"\"\"\u001b[39;00m\n\u001b[0;32m-> 1282\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_send_request(method, url, body, headers, encode_chunked)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1328\u001b[0m, in \u001b[0;36mHTTPConnection._send_request\u001b[0;34m(self, method, url, body, headers, encode_chunked)\u001b[0m\n\u001b[1;32m   1327\u001b[0m     body \u001b[39m=\u001b[39m _encode(body, \u001b[39m'\u001b[39m\u001b[39mbody\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[0;32m-> 1328\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mendheaders(body, encode_chunked\u001b[39m=\u001b[39;49mencode_chunked)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1277\u001b[0m, in \u001b[0;36mHTTPConnection.endheaders\u001b[0;34m(self, message_body, encode_chunked)\u001b[0m\n\u001b[1;32m   1276\u001b[0m     \u001b[39mraise\u001b[39;00m CannotSendHeader()\n\u001b[0;32m-> 1277\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_send_output(message_body, encode_chunked\u001b[39m=\u001b[39;49mencode_chunked)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1037\u001b[0m, in \u001b[0;36mHTTPConnection._send_output\u001b[0;34m(self, message_body, encode_chunked)\u001b[0m\n\u001b[1;32m   1036\u001b[0m \u001b[39mdel\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_buffer[:]\n\u001b[0;32m-> 1037\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49msend(msg)\n\u001b[1;32m   1039\u001b[0m \u001b[39mif\u001b[39;00m message_body \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m   1040\u001b[0m \n\u001b[1;32m   1041\u001b[0m     \u001b[39m# create a consistent interface to message_body\u001b[39;00m\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:975\u001b[0m, in \u001b[0;36mHTTPConnection.send\u001b[0;34m(self, data)\u001b[0m\n\u001b[1;32m    974\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mauto_open:\n\u001b[0;32m--> 975\u001b[0m     \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mconnect()\n\u001b[1;32m    976\u001b[0m \u001b[39melse\u001b[39;00m:\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1454\u001b[0m, in \u001b[0;36mHTTPSConnection.connect\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m   1452\u001b[0m     server_hostname \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mhost\n\u001b[0;32m-> 1454\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msock \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_context\u001b[39m.\u001b[39;49mwrap_socket(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49msock,\n\u001b[1;32m   1455\u001b[0m                                       server_hostname\u001b[39m=\u001b[39;49mserver_hostname)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py:513\u001b[0m, in \u001b[0;36mSSLContext.wrap_socket\u001b[0;34m(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)\u001b[0m\n\u001b[1;32m    507\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mwrap_socket\u001b[39m(\u001b[39mself\u001b[39m, sock, server_side\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m,\n\u001b[1;32m    508\u001b[0m                 do_handshake_on_connect\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m,\n\u001b[1;32m    509\u001b[0m                 suppress_ragged_eofs\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m,\n\u001b[1;32m    510\u001b[0m                 server_hostname\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, session\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m):\n\u001b[1;32m    511\u001b[0m     \u001b[39m# SSLSocket class handles server_hostname encoding before it calls\u001b[39;00m\n\u001b[1;32m    512\u001b[0m     \u001b[39m# ctx._wrap_socket()\u001b[39;00m\n\u001b[0;32m--> 513\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49msslsocket_class\u001b[39m.\u001b[39;49m_create(\n\u001b[1;32m    514\u001b[0m         sock\u001b[39m=\u001b[39;49msock,\n\u001b[1;32m    515\u001b[0m         server_side\u001b[39m=\u001b[39;49mserver_side,\n\u001b[1;32m    516\u001b[0m         do_handshake_on_connect\u001b[39m=\u001b[39;49mdo_handshake_on_connect,\n\u001b[1;32m    517\u001b[0m         suppress_ragged_eofs\u001b[39m=\u001b[39;49msuppress_ragged_eofs,\n\u001b[1;32m    518\u001b[0m         server_hostname\u001b[39m=\u001b[39;49mserver_hostname,\n\u001b[1;32m    519\u001b[0m         context\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m,\n\u001b[1;32m    520\u001b[0m         session\u001b[39m=\u001b[39;49msession\n\u001b[1;32m    521\u001b[0m     )\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py:1071\u001b[0m, in \u001b[0;36mSSLSocket._create\u001b[0;34m(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)\u001b[0m\n\u001b[1;32m   1070\u001b[0m             \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mdo_handshake_on_connect should not be specified for non-blocking sockets\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m-> 1071\u001b[0m         \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdo_handshake()\n\u001b[1;32m   1072\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mOSError\u001b[39;00m, \u001b[39mValueError\u001b[39;00m):\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py:1342\u001b[0m, in \u001b[0;36mSSLSocket.do_handshake\u001b[0;34m(self, block)\u001b[0m\n\u001b[1;32m   1341\u001b[0m         \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msettimeout(\u001b[39mNone\u001b[39;00m)\n\u001b[0;32m-> 1342\u001b[0m     \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_sslobj\u001b[39m.\u001b[39;49mdo_handshake()\n\u001b[1;32m   1343\u001b[0m \u001b[39mfinally\u001b[39;00m:\n",
      "\u001b[0;31mSSLCertVerificationError\u001b[0m: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mURLError\u001b[0m                                  Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[12], line 4\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39murllib\u001b[39;00m\u001b[39m.\u001b[39;00m\u001b[39mrequest\u001b[39;00m \u001b[39mimport\u001b[39;00m urlopen\n\u001b[1;32m      3\u001b[0m url \u001b[39m=\u001b[39m \u001b[39m'\u001b[39m\u001b[39mhttps://webscraper.io/test-sites/e-commerce/allinone/phones\u001b[39m\u001b[39m'\u001b[39m\n\u001b[0;32m----> 4\u001b[0m page \u001b[39m=\u001b[39m urlopen(url)\n\u001b[1;32m      5\u001b[0m html \u001b[39m=\u001b[39m page\u001b[39m.\u001b[39mread()\u001b[39m.\u001b[39mdecode(\u001b[39m'\u001b[39m\u001b[39mutf-8\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m      6\u001b[0m soup \u001b[39m=\u001b[39m BeautifulSoup(html, \u001b[39m'\u001b[39m\u001b[39mhtml.parser\u001b[39m\u001b[39m'\u001b[39m)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:216\u001b[0m, in \u001b[0;36murlopen\u001b[0;34m(url, data, timeout, cafile, capath, cadefault, context)\u001b[0m\n\u001b[1;32m    214\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m    215\u001b[0m     opener \u001b[39m=\u001b[39m _opener\n\u001b[0;32m--> 216\u001b[0m \u001b[39mreturn\u001b[39;00m opener\u001b[39m.\u001b[39;49mopen(url, data, timeout)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:519\u001b[0m, in \u001b[0;36mOpenerDirector.open\u001b[0;34m(self, fullurl, data, timeout)\u001b[0m\n\u001b[1;32m    516\u001b[0m     req \u001b[39m=\u001b[39m meth(req)\n\u001b[1;32m    518\u001b[0m sys\u001b[39m.\u001b[39maudit(\u001b[39m'\u001b[39m\u001b[39murllib.Request\u001b[39m\u001b[39m'\u001b[39m, req\u001b[39m.\u001b[39mfull_url, req\u001b[39m.\u001b[39mdata, req\u001b[39m.\u001b[39mheaders, req\u001b[39m.\u001b[39mget_method())\n\u001b[0;32m--> 519\u001b[0m response \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_open(req, data)\n\u001b[1;32m    521\u001b[0m \u001b[39m# post-process response\u001b[39;00m\n\u001b[1;32m    522\u001b[0m meth_name \u001b[39m=\u001b[39m protocol\u001b[39m+\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m_response\u001b[39m\u001b[39m\"\u001b[39m\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:536\u001b[0m, in \u001b[0;36mOpenerDirector._open\u001b[0;34m(self, req, data)\u001b[0m\n\u001b[1;32m    533\u001b[0m     \u001b[39mreturn\u001b[39;00m result\n\u001b[1;32m    535\u001b[0m protocol \u001b[39m=\u001b[39m req\u001b[39m.\u001b[39mtype\n\u001b[0;32m--> 536\u001b[0m result \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call_chain(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mhandle_open, protocol, protocol \u001b[39m+\u001b[39;49m\n\u001b[1;32m    537\u001b[0m                           \u001b[39m'\u001b[39;49m\u001b[39m_open\u001b[39;49m\u001b[39m'\u001b[39;49m, req)\n\u001b[1;32m    538\u001b[0m \u001b[39mif\u001b[39;00m result:\n\u001b[1;32m    539\u001b[0m     \u001b[39mreturn\u001b[39;00m result\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:496\u001b[0m, in \u001b[0;36mOpenerDirector._call_chain\u001b[0;34m(self, chain, kind, meth_name, *args)\u001b[0m\n\u001b[1;32m    494\u001b[0m \u001b[39mfor\u001b[39;00m handler \u001b[39min\u001b[39;00m handlers:\n\u001b[1;32m    495\u001b[0m     func \u001b[39m=\u001b[39m \u001b[39mgetattr\u001b[39m(handler, meth_name)\n\u001b[0;32m--> 496\u001b[0m     result \u001b[39m=\u001b[39m func(\u001b[39m*\u001b[39;49margs)\n\u001b[1;32m    497\u001b[0m     \u001b[39mif\u001b[39;00m result \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m    498\u001b[0m         \u001b[39mreturn\u001b[39;00m result\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:1391\u001b[0m, in \u001b[0;36mHTTPSHandler.https_open\u001b[0;34m(self, req)\u001b[0m\n\u001b[1;32m   1390\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mhttps_open\u001b[39m(\u001b[39mself\u001b[39m, req):\n\u001b[0;32m-> 1391\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdo_open(http\u001b[39m.\u001b[39;49mclient\u001b[39m.\u001b[39;49mHTTPSConnection, req,\n\u001b[1;32m   1392\u001b[0m         context\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_context, check_hostname\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_check_hostname)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:1351\u001b[0m, in \u001b[0;36mAbstractHTTPHandler.do_open\u001b[0;34m(self, http_class, req, **http_conn_args)\u001b[0m\n\u001b[1;32m   1348\u001b[0m         h\u001b[39m.\u001b[39mrequest(req\u001b[39m.\u001b[39mget_method(), req\u001b[39m.\u001b[39mselector, req\u001b[39m.\u001b[39mdata, headers,\n\u001b[1;32m   1349\u001b[0m                   encode_chunked\u001b[39m=\u001b[39mreq\u001b[39m.\u001b[39mhas_header(\u001b[39m'\u001b[39m\u001b[39mTransfer-encoding\u001b[39m\u001b[39m'\u001b[39m))\n\u001b[1;32m   1350\u001b[0m     \u001b[39mexcept\u001b[39;00m \u001b[39mOSError\u001b[39;00m \u001b[39mas\u001b[39;00m err: \u001b[39m# timeout error\u001b[39;00m\n\u001b[0;32m-> 1351\u001b[0m         \u001b[39mraise\u001b[39;00m URLError(err)\n\u001b[1;32m   1352\u001b[0m     r \u001b[39m=\u001b[39m h\u001b[39m.\u001b[39mgetresponse()\n\u001b[1;32m   1353\u001b[0m \u001b[39mexcept\u001b[39;00m:\n",
      "\u001b[0;31mURLError\u001b[0m: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>"
     ]
    }
   ],
   "source": [
    "from bs4 import BeautifulSoup\n",
    "from urllib.request import urlopen\n",
    "url = 'https://webscraper.io/test-sites/e-commerce/allinone/phones'\n",
    "page = urlopen(url)\n",
    "html = page.read().decode('utf-8')\n",
    "soup = BeautifulSoup(html, 'html.parser')\n",
    "links = set()\n",
    "for link in soup.find_all('a'):\n",
    "    l = link.get('href')\n",
    "    if l != None and l.startswith('https'):\n",
    "        links.add(l)\n",
    "for link in links:\n",
    "    print(link)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "При использовании XPath точно такой же результат даст следующий скрипт:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [
    {
     "ename": "URLError",
     "evalue": "<urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mSSLCertVerificationError\u001b[0m                  Traceback (most recent call last)",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:1348\u001b[0m, in \u001b[0;36mAbstractHTTPHandler.do_open\u001b[0;34m(self, http_class, req, **http_conn_args)\u001b[0m\n\u001b[1;32m   1347\u001b[0m \u001b[39mtry\u001b[39;00m:\n\u001b[0;32m-> 1348\u001b[0m     h\u001b[39m.\u001b[39;49mrequest(req\u001b[39m.\u001b[39;49mget_method(), req\u001b[39m.\u001b[39;49mselector, req\u001b[39m.\u001b[39;49mdata, headers,\n\u001b[1;32m   1349\u001b[0m               encode_chunked\u001b[39m=\u001b[39;49mreq\u001b[39m.\u001b[39;49mhas_header(\u001b[39m'\u001b[39;49m\u001b[39mTransfer-encoding\u001b[39;49m\u001b[39m'\u001b[39;49m))\n\u001b[1;32m   1350\u001b[0m \u001b[39mexcept\u001b[39;00m \u001b[39mOSError\u001b[39;00m \u001b[39mas\u001b[39;00m err: \u001b[39m# timeout error\u001b[39;00m\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1282\u001b[0m, in \u001b[0;36mHTTPConnection.request\u001b[0;34m(self, method, url, body, headers, encode_chunked)\u001b[0m\n\u001b[1;32m   1281\u001b[0m \u001b[39m\u001b[39m\u001b[39m\"\"\"Send a complete request to the server.\"\"\"\u001b[39;00m\n\u001b[0;32m-> 1282\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_send_request(method, url, body, headers, encode_chunked)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1328\u001b[0m, in \u001b[0;36mHTTPConnection._send_request\u001b[0;34m(self, method, url, body, headers, encode_chunked)\u001b[0m\n\u001b[1;32m   1327\u001b[0m     body \u001b[39m=\u001b[39m _encode(body, \u001b[39m'\u001b[39m\u001b[39mbody\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[0;32m-> 1328\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mendheaders(body, encode_chunked\u001b[39m=\u001b[39;49mencode_chunked)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1277\u001b[0m, in \u001b[0;36mHTTPConnection.endheaders\u001b[0;34m(self, message_body, encode_chunked)\u001b[0m\n\u001b[1;32m   1276\u001b[0m     \u001b[39mraise\u001b[39;00m CannotSendHeader()\n\u001b[0;32m-> 1277\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_send_output(message_body, encode_chunked\u001b[39m=\u001b[39;49mencode_chunked)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1037\u001b[0m, in \u001b[0;36mHTTPConnection._send_output\u001b[0;34m(self, message_body, encode_chunked)\u001b[0m\n\u001b[1;32m   1036\u001b[0m \u001b[39mdel\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39m_buffer[:]\n\u001b[0;32m-> 1037\u001b[0m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49msend(msg)\n\u001b[1;32m   1039\u001b[0m \u001b[39mif\u001b[39;00m message_body \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m   1040\u001b[0m \n\u001b[1;32m   1041\u001b[0m     \u001b[39m# create a consistent interface to message_body\u001b[39;00m\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:975\u001b[0m, in \u001b[0;36mHTTPConnection.send\u001b[0;34m(self, data)\u001b[0m\n\u001b[1;32m    974\u001b[0m \u001b[39mif\u001b[39;00m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mauto_open:\n\u001b[0;32m--> 975\u001b[0m     \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mconnect()\n\u001b[1;32m    976\u001b[0m \u001b[39melse\u001b[39;00m:\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py:1454\u001b[0m, in \u001b[0;36mHTTPSConnection.connect\u001b[0;34m(self)\u001b[0m\n\u001b[1;32m   1452\u001b[0m     server_hostname \u001b[39m=\u001b[39m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39mhost\n\u001b[0;32m-> 1454\u001b[0m \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msock \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_context\u001b[39m.\u001b[39;49mwrap_socket(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49msock,\n\u001b[1;32m   1455\u001b[0m                                       server_hostname\u001b[39m=\u001b[39;49mserver_hostname)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py:513\u001b[0m, in \u001b[0;36mSSLContext.wrap_socket\u001b[0;34m(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)\u001b[0m\n\u001b[1;32m    507\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mwrap_socket\u001b[39m(\u001b[39mself\u001b[39m, sock, server_side\u001b[39m=\u001b[39m\u001b[39mFalse\u001b[39;00m,\n\u001b[1;32m    508\u001b[0m                 do_handshake_on_connect\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m,\n\u001b[1;32m    509\u001b[0m                 suppress_ragged_eofs\u001b[39m=\u001b[39m\u001b[39mTrue\u001b[39;00m,\n\u001b[1;32m    510\u001b[0m                 server_hostname\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m, session\u001b[39m=\u001b[39m\u001b[39mNone\u001b[39;00m):\n\u001b[1;32m    511\u001b[0m     \u001b[39m# SSLSocket class handles server_hostname encoding before it calls\u001b[39;00m\n\u001b[1;32m    512\u001b[0m     \u001b[39m# ctx._wrap_socket()\u001b[39;00m\n\u001b[0;32m--> 513\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49msslsocket_class\u001b[39m.\u001b[39;49m_create(\n\u001b[1;32m    514\u001b[0m         sock\u001b[39m=\u001b[39;49msock,\n\u001b[1;32m    515\u001b[0m         server_side\u001b[39m=\u001b[39;49mserver_side,\n\u001b[1;32m    516\u001b[0m         do_handshake_on_connect\u001b[39m=\u001b[39;49mdo_handshake_on_connect,\n\u001b[1;32m    517\u001b[0m         suppress_ragged_eofs\u001b[39m=\u001b[39;49msuppress_ragged_eofs,\n\u001b[1;32m    518\u001b[0m         server_hostname\u001b[39m=\u001b[39;49mserver_hostname,\n\u001b[1;32m    519\u001b[0m         context\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m,\n\u001b[1;32m    520\u001b[0m         session\u001b[39m=\u001b[39;49msession\n\u001b[1;32m    521\u001b[0m     )\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py:1071\u001b[0m, in \u001b[0;36mSSLSocket._create\u001b[0;34m(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)\u001b[0m\n\u001b[1;32m   1070\u001b[0m             \u001b[39mraise\u001b[39;00m \u001b[39mValueError\u001b[39;00m(\u001b[39m\"\u001b[39m\u001b[39mdo_handshake_on_connect should not be specified for non-blocking sockets\u001b[39m\u001b[39m\"\u001b[39m)\n\u001b[0;32m-> 1071\u001b[0m         \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdo_handshake()\n\u001b[1;32m   1072\u001b[0m \u001b[39mexcept\u001b[39;00m (\u001b[39mOSError\u001b[39;00m, \u001b[39mValueError\u001b[39;00m):\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/ssl.py:1342\u001b[0m, in \u001b[0;36mSSLSocket.do_handshake\u001b[0;34m(self, block)\u001b[0m\n\u001b[1;32m   1341\u001b[0m         \u001b[39mself\u001b[39m\u001b[39m.\u001b[39msettimeout(\u001b[39mNone\u001b[39;00m)\n\u001b[0;32m-> 1342\u001b[0m     \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_sslobj\u001b[39m.\u001b[39;49mdo_handshake()\n\u001b[1;32m   1343\u001b[0m \u001b[39mfinally\u001b[39;00m:\n",
      "\u001b[0;31mSSLCertVerificationError\u001b[0m: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)",
      "\nDuring handling of the above exception, another exception occurred:\n",
      "\u001b[0;31mURLError\u001b[0m                                  Traceback (most recent call last)",
      "Cell \u001b[0;32mIn[7], line 4\u001b[0m\n\u001b[1;32m      2\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mlxml\u001b[39;00m \u001b[39mimport\u001b[39;00m etree\n\u001b[1;32m      3\u001b[0m url \u001b[39m=\u001b[39m \u001b[39m'\u001b[39m\u001b[39mhttps://webscraper.io/test-sites/e-commerce/allinone/phones\u001b[39m\u001b[39m'\u001b[39m\n\u001b[0;32m----> 4\u001b[0m page \u001b[39m=\u001b[39m urlopen(url)\n\u001b[1;32m      5\u001b[0m html_code \u001b[39m=\u001b[39m page\u001b[39m.\u001b[39mread()\u001b[39m.\u001b[39mdecode(\u001b[39m'\u001b[39m\u001b[39mutf-8\u001b[39m\u001b[39m'\u001b[39m)\n\u001b[1;32m      6\u001b[0m tree \u001b[39m=\u001b[39m etree\u001b[39m.\u001b[39mHTML(html_code)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:216\u001b[0m, in \u001b[0;36murlopen\u001b[0;34m(url, data, timeout, cafile, capath, cadefault, context)\u001b[0m\n\u001b[1;32m    214\u001b[0m \u001b[39melse\u001b[39;00m:\n\u001b[1;32m    215\u001b[0m     opener \u001b[39m=\u001b[39m _opener\n\u001b[0;32m--> 216\u001b[0m \u001b[39mreturn\u001b[39;00m opener\u001b[39m.\u001b[39;49mopen(url, data, timeout)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:519\u001b[0m, in \u001b[0;36mOpenerDirector.open\u001b[0;34m(self, fullurl, data, timeout)\u001b[0m\n\u001b[1;32m    516\u001b[0m     req \u001b[39m=\u001b[39m meth(req)\n\u001b[1;32m    518\u001b[0m sys\u001b[39m.\u001b[39maudit(\u001b[39m'\u001b[39m\u001b[39murllib.Request\u001b[39m\u001b[39m'\u001b[39m, req\u001b[39m.\u001b[39mfull_url, req\u001b[39m.\u001b[39mdata, req\u001b[39m.\u001b[39mheaders, req\u001b[39m.\u001b[39mget_method())\n\u001b[0;32m--> 519\u001b[0m response \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_open(req, data)\n\u001b[1;32m    521\u001b[0m \u001b[39m# post-process response\u001b[39;00m\n\u001b[1;32m    522\u001b[0m meth_name \u001b[39m=\u001b[39m protocol\u001b[39m+\u001b[39m\u001b[39m\"\u001b[39m\u001b[39m_response\u001b[39m\u001b[39m\"\u001b[39m\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:536\u001b[0m, in \u001b[0;36mOpenerDirector._open\u001b[0;34m(self, req, data)\u001b[0m\n\u001b[1;32m    533\u001b[0m     \u001b[39mreturn\u001b[39;00m result\n\u001b[1;32m    535\u001b[0m protocol \u001b[39m=\u001b[39m req\u001b[39m.\u001b[39mtype\n\u001b[0;32m--> 536\u001b[0m result \u001b[39m=\u001b[39m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_call_chain(\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mhandle_open, protocol, protocol \u001b[39m+\u001b[39;49m\n\u001b[1;32m    537\u001b[0m                           \u001b[39m'\u001b[39;49m\u001b[39m_open\u001b[39;49m\u001b[39m'\u001b[39;49m, req)\n\u001b[1;32m    538\u001b[0m \u001b[39mif\u001b[39;00m result:\n\u001b[1;32m    539\u001b[0m     \u001b[39mreturn\u001b[39;00m result\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:496\u001b[0m, in \u001b[0;36mOpenerDirector._call_chain\u001b[0;34m(self, chain, kind, meth_name, *args)\u001b[0m\n\u001b[1;32m    494\u001b[0m \u001b[39mfor\u001b[39;00m handler \u001b[39min\u001b[39;00m handlers:\n\u001b[1;32m    495\u001b[0m     func \u001b[39m=\u001b[39m \u001b[39mgetattr\u001b[39m(handler, meth_name)\n\u001b[0;32m--> 496\u001b[0m     result \u001b[39m=\u001b[39m func(\u001b[39m*\u001b[39;49margs)\n\u001b[1;32m    497\u001b[0m     \u001b[39mif\u001b[39;00m result \u001b[39mis\u001b[39;00m \u001b[39mnot\u001b[39;00m \u001b[39mNone\u001b[39;00m:\n\u001b[1;32m    498\u001b[0m         \u001b[39mreturn\u001b[39;00m result\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:1391\u001b[0m, in \u001b[0;36mHTTPSHandler.https_open\u001b[0;34m(self, req)\u001b[0m\n\u001b[1;32m   1390\u001b[0m \u001b[39mdef\u001b[39;00m \u001b[39mhttps_open\u001b[39m(\u001b[39mself\u001b[39m, req):\n\u001b[0;32m-> 1391\u001b[0m     \u001b[39mreturn\u001b[39;00m \u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49mdo_open(http\u001b[39m.\u001b[39;49mclient\u001b[39m.\u001b[39;49mHTTPSConnection, req,\n\u001b[1;32m   1392\u001b[0m         context\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_context, check_hostname\u001b[39m=\u001b[39;49m\u001b[39mself\u001b[39;49m\u001b[39m.\u001b[39;49m_check_hostname)\n",
      "File \u001b[0;32m/Library/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:1351\u001b[0m, in \u001b[0;36mAbstractHTTPHandler.do_open\u001b[0;34m(self, http_class, req, **http_conn_args)\u001b[0m\n\u001b[1;32m   1348\u001b[0m         h\u001b[39m.\u001b[39mrequest(req\u001b[39m.\u001b[39mget_method(), req\u001b[39m.\u001b[39mselector, req\u001b[39m.\u001b[39mdata, headers,\n\u001b[1;32m   1349\u001b[0m                   encode_chunked\u001b[39m=\u001b[39mreq\u001b[39m.\u001b[39mhas_header(\u001b[39m'\u001b[39m\u001b[39mTransfer-encoding\u001b[39m\u001b[39m'\u001b[39m))\n\u001b[1;32m   1350\u001b[0m     \u001b[39mexcept\u001b[39;00m \u001b[39mOSError\u001b[39;00m \u001b[39mas\u001b[39;00m err: \u001b[39m# timeout error\u001b[39;00m\n\u001b[0;32m-> 1351\u001b[0m         \u001b[39mraise\u001b[39;00m URLError(err)\n\u001b[1;32m   1352\u001b[0m     r \u001b[39m=\u001b[39m h\u001b[39m.\u001b[39mgetresponse()\n\u001b[1;32m   1353\u001b[0m \u001b[39mexcept\u001b[39;00m:\n",
      "\u001b[0;31mURLError\u001b[0m: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)>"
     ]
    }
   ],
   "source": [
    "from urllib.request import urlopen\n",
    "from lxml import etree\n",
    "url = 'https://webscraper.io/test-sites/e-commerce/allinone/phones'\n",
    "page = urlopen(url)\n",
    "html_code = page.read().decode('utf-8')\n",
    "tree = etree.HTML(html_code)\n",
    "sp = tree.xpath(\"//li/a/@href\")\n",
    "links = set()\n",
    "for link in sp:\n",
    "    if link.startswith('http'):\n",
    "        links.add(link)\n",
    "for link in links:\n",
    "    print(link)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Имитация действий пользователя в браузере\n",
    "При скрапинге сайтов очень часто требуется авторизация, нажатие кнопок «Читать дальше», переход по ссылкам, отправка форм, прокручивание ленты и так далее. Отсюда возникает необходимость имитации действий пользователя. Как правило, для этих целей используют `Selenium`, однако есть и более легкое решение – библиотека `MechanicalSoup`:\n",
    "```\n",
    "pip install MechanicalSoup\n",
    "```\n",
    "По сути, MechanicalSoup исполняет роль браузера без графического интерфейса. Помимо имитации нужного взаимодействия с элементами страниц, MechanicalSoup также парсит HTML-код, используя для этого все функции BeautifulSoup.\n",
    "\n",
    "Воспользуемся тестовым сайтом http://httpbin.org/, на котором есть возможность отправки формы заказа пиццы:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mechanicalsoup\n",
    "browser = mechanicalsoup.StatefulBrowser()\n",
    "browser.open(\"http://httpbin.org/\")\n",
    "browser.follow_link(\"forms\")\n",
    "browser.select_form('form[action=\"/post\"]')\n",
    "print(browser.form.print_summary())"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "В приведенном выше примере браузер MechanicalSoup перешел по внутренней ссылке http://httpbin.org/forms/post и вернул описание полей ввода.\n",
    "Перейдем к имитации заполнения формы:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "browser[\"custname\"] = \"Best Customer\"\n",
    "browser[\"custtel\"] = \"+7 916 123 45 67\"\n",
    "browser[\"custemail\"] = \"trex@example.com\"\n",
    "browser[\"size\"] = \"large\"\n",
    "browser[\"topping\"] = (\"cheese\", \"mushroom\")\n",
    "browser[\"comments\"] = \"Add more cheese, plz. More than the last time!\""
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Теперь форму можно отправить:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "response = browser.submit_selected()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Результат можно вывести с помощью \n",
    "print(response.text)"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Скрапинг и парсинг динамического контента\n",
    "\n",
    "Все примеры, которые мы рассмотрели выше, отлично работают на статических страницах. Однако на множестве платформ используется динамический подход к генерации и загрузке контента – к примеру, для просмотра всех доступных товаров в онлайн-магазине страницу нужно не только открыть, но и прокрутить до футера. Для работы с динамическим контентом в Python нужно установить:\n",
    "\n",
    "- Модуль [Selenium](https://pypi.org/project/selenium/).\n",
    "- Драйвер [Selenium WebDriver](https://www.selenium.dev/documentation/webdriver/getting_started/install_drivers/) для браузера.\n",
    "Если установка прошла успешно, выполнение этого кода приведет к автоматическому открытию страницы:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from selenium import webdriver\n",
    "driver = webdriver.Chrome() # или webdriver.Firefox()\n",
    "driver.get('https://google.com')"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Бывает, что даже после установки оптимальной версии драйвера интерпретатор Python возвращает ошибку `OSError: [WinError 216] Версия \"%1\" не совместима с версией Windows`. В этом случае нужно воспользоваться модулем [webdriver-manager](https://pypi.org/project/webdriver-manager/), который самостоятельно установит подходящий драйвер для нужного браузера:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from selenium import webdriver\n",
    "from selenium.webdriver.chrome.service import Service as ChromeService\n",
    "from webdriver_manager.chrome import ChromeDriverManager\n",
    "driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))"
   ]
  },
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Имитация прокрутки страницы и парсинг данных\n",
    "В качестве примера скрапинга и парсинга динамического сайта мы воспользуемся разделом [тестового онлайн-магазина](https://webscraper.io/test-sites/e-commerce/scroll/computers/tablets). Здесь расположены карточки с информацией о планшетах. Карточки загружаются ряд за рядом при прокрутке страницы:\n",
    "\n",
    "![](https://media.proglib.io/posts/2023/03/08/8e0d31d35196e077f9c9470abf03bffd.png)\n",
    "\n",
    "Пока страница не прокручена, полный HTML-код с информацией о планшетах получить невозможно. Для имитации прокрутки мы воспользуемся скриптом `'window.scrollTo(0, document.body.scrollHeight);'`. Цены планшетов находятся в тегах h4 класса pull-right price, а названия моделей – в тексте ссылок a класса title. Готовый код выглядит так:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from selenium import webdriver\n",
    "from selenium.webdriver.chrome.service import Service as ChromeService\n",
    "from webdriver_manager.chrome import ChromeDriverManager\n",
    "import time\n",
    "from bs4 import BeautifulSoup\n",
    "\n",
    "driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager(cache_valid_range=10).install()))\n",
    "url = 'https://webscraper.io/test-sites/e-commerce/scroll/computers/tablets'\n",
    "driver.get(url)\n",
    "driver.execute_script('window.scrollTo(0, document.body.scrollHeight);')\n",
    "time.sleep(5) \n",
    "html = driver.page_source\n",
    "soup = BeautifulSoup(html, 'html.parser')\n",
    "prices = soup.find_all('h4', class_='pull-right price')\n",
    "models = soup.find_all('a', class_='title')\n",
    "for model, price in zip(models, prices):\n",
    "    m = model.get_text()\n",
    "    p = price.get_text()\n",
    "    print(f'Планшет {m}, цена - {p}')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "virt",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.7"
  },
  "orig_nbformat": 4,
  "vscode": {
   "interpreter": {
    "hash": "6af10a1dfa69cb9a161b1b5cbe263258affe3deded5a2a66904556c11193fdb1"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
